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Consider the problem of estimating a multivariate normal mean with a known variance matrix, 
which is not necessarily proportional to the identity matrix. The coordinates are shrunk directly 
in proportion to their variances in Efron and Morris’ (J. Amer. Statist. Assoc. 68 (1973) 117- 
130) empirical Bayes approach, whereas inversely in proportion to their variances in Berger’s 
(Ann. Statist. 4 (1976) 223-226) minimax estimators. We propose a new minimax estimator, 
by approximately minimizing the Bayes risk with a normal prior among a class of minimax 
estimators where the shrinkage direction is open to specification and the shrinkage magnitude is 
determined to achieve minimaxity. The proposed estimator has an interesting simple form such 
that one group of coordinates are shrunk in the direction of Berger’s estimator and the remaining 
coordinates are shrunk in the direction of the Bayes rule. Moreover, the proposed estimator is 
scale adaptive: it can achieve close to the minimum Bayes risk simultaneously over a scale class 
of normal priors (including the specified prior) and achieve close to the minimax linear risk over 
a corresponding scale class of liyper-rectangles. For various scenarios in our numerical study, 
the proposed estimators with extreme priors yield more substantial risk reduction than existing 
minimax estimators. 

Keywords: Bayes risk; empirical Bayes; minimax estimation; multivariate normal mean; 
shrinkage estimation; unequal variances 


1. Introduction 

A fundamental statistical problem is shrinkage estimation of a multivariate normal mean. 
See, for example, the February 2012 issue of Statistical Science for a broad range of theory, 
methods, and applications. Let X = (X\,... ,X p ) T be multivariate normal with unknown 
mean vector 9= (9i,... ,9 p ) T and known variance matrix E. Consider the problem of 
estimating 9 by an estimator 5 = S(X) under the loss L(S, 9) = (S — 9) T Q(5 — 9), where 
Q is a known positive definite, symmetric matrix. The risk of S is R(S,9) = Eg{L(S,9)}. 
The general problem can be transformed into a canonical form such that E is diagonal 
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and Q = I, the identity matrix (e.g., Lehmann and Casella [21], Problem 5.5.11). For 
simplicity, assume except in Section 3.2 that E is D = diag(di,..., d p ) and L(S,9) = 
||J — 9 1| 2 , where ||a:|| 2 = x T x for a column vector x. The letter D is substituted for E to 
emphasize that it is diagonal. 

For this problem, we aim to develop shrinkage estimators that are both minimax and 
capable of effective risk reduction over the usual estimator Jo = X even in the het- 
eroscedastic case (i.e., di,... ,d p are not equal). An estimator of 9 is minimax if and only 
if, regardless of 9 £ R p , its risk is always no greater than )Cj=i dj, the risk of <5o - For 
p > 3, minimax estimators different from and hence dominating So are first discovered 
in the homoscedastic case where D = a 2 I (i.e., d\ = ■ • • = d p = a 2 ). James and Stein 
[19] showed that = (1 — c<t 2 /||X|| 2 )AT is minimax provided 0 < c < 2 (p — 2). Stein 
[26] suggested the positive-part estimator J;] s+ = (1 — c<r 2 /|| X\\ 2 ) + X, which dominates 
Throughout, a+ =max(0,a). Shrinkage estimation has since been developed into 
a general methodology with various approaches, including empirical Bayes (Efron and 
Morris [17]; Morris [22]) and hierarchical Bayes (Strawderman [28]; Berger and Robert 
[7]). While these approaches are prescriptive for constructing shrinkage estimators, min- 
imaxity is not automatically achieved but needs to be checked separately. 

For the heteroscedastic case, there remain challenging issues on how much observations 
with different variances should be shrunk relatively to each other (e.g., Casella [15], 
Morris [22]). For the empirical Bayes approach (Efron and Morris [17]), the coordinates 
of X are shrunk directly in proportion to their variances. But the existing estimators 
are, in general, non-minimax (i.e., may have a greater risk than the usual estimator 
Jo)- On the other hand, Berger [3] proposed minimax estimators, including admissible 
minimax estimators, such that the coordinates of X are shrunk inversely in proportion 
to their variances. But the risk reduction achieved over Jo is insubstantial unless all the 
observations have similar variances. 

To address the foregoing issues, we develop novel minimax estimators for multivariate 
normal means under heteroscedasticity. There are two central ideas in our approach. The 
first is to develop a class of minimax estimators by generalizing a geometric argument es¬ 
sentially in Stein [25] (see also Brandwein and Strawderman [11]). For the homoscedastic 
case, the argument shows that J^ s can be derived as an approximation to the best linear 
estimator of the form (1 — A)X, where A is a scalar. In fact, the optimal choice of A in 
minimizing the risk is pa 2 /Eg(\\X\\ 2 ). Replacing Eg(||Af|| 2 ) by ||Af|| 2 leads to J^ s with 
c = p. This derivation is highly informative, even though it does not yield the optimal 
value c = p— 2. 

Our class of minimax estimators are of the linear form (/ — A A)X, where A is a 
nonnegative definite, diagonal matrix indicating the direction of shrinkage and A is a 
scalar indicating the magnitude of shrinkage. The matrix A is open to specification, 
depending on the variance matrix D but not on the data X. For a fixed A , the scalar A 
is determined to achieve minimaxity, depending on both D and X. Berger’s [3] minimax 
estimator corresponds to the special choice A = D~ X , thereby leading to the unusual 
pattern of shrinkage discussed above. 

The second idea of our approach is to choose A by approximately minimizing the 
Bayes risk with a normal prior in our class of minimax estimators. The Bayes risk is 
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used to measure average risk reduction for 9 in an elliptical region as in Berger [4, 
5]. It turns out that the solution of A obtained by our approximation strategy has an 
interesting simple form. In fact, the coordinates of X are automatically segmented into 
two groups, based on their Bayes “importance” (Berger [5]), which is of the same order 
as the coordinate variances when the specified prior is homoscedastic. The coordinates of 
high Bayes “importance” are shrunk inversely in proportion to their variances, whereas 
the remaining coordinates are shrunk in the direction of the Bayes rule. This shrinkage 
pattern may appear paradoxical: it may be expected that the coordinates of high Bayes 
“importance” are to be shrunk in the direction of the Bayes rule. But that scheme is 
inherently aimed at reducing the Bayes risk under the specified prior and, in general, fails 
to achieve minimaxity (i.e., it may lead to even a greater risk than the usual estimator 


In addition to simplicity and minimaxity, we further show that the proposed estimator 
is scale adaptive in reducing the Bayes risk: it achieves close to the minimum Bayes 
risk, with the difference no greater than the sum of the 4 highest Bayes “importance” 
of the coordinates of X , simultaneously over a scale class of normal priors (including 
the specified prior). To our knowledge, the proposed estimator seems to be the first one 
with such a property in the general heteroscedastic case. Previously, in the homoscedastic 
case, 2 is known to achieve the minimum Bayes risk up to the sum of 2 (equal-valued) 
Bayes “importance” of the coordinates over the scale class of homoscedastic normal priors 
(Efron and Morris [17]). 

The rest of this article is organized as follows. Section 2 gives a review of existing esti¬ 
mators. Section 3 develops the new approach and studies risk properties of the proposed 
estimator. Section 4 presents a simulation study. Section 5 provides concluding remarks. 
All proofs are collected in the Appendix. 

2. Existing estimators 

We describe a number of existing shrinkage estimators. See Lehmann and Casella [21] for 
a textbook account and Strawderman [29] and Morris and Lysy [23] for recent reviews. 
Throughout, tr(-) denotes the trace and A max (-) denotes the largest eigenvalue. Then 
tr(-D) = Ej= i d j and A max (71) = max(di,... ,d p ). 

For a Bayes approach, assume the prior distribution: 9 ~ N( 0 , 7 J), where 7 is the prior 
variance. The Bayes rule is given componentwise by <5^ ayes = {l — dj/(dj +7 )}Xj. Then 
the greater dj is, the more Xj is shrunk whether 7 is fixed or estimated from the data. For 
the empirical Bayes approach of Efron and Morris [17], 7 is estimated by the maximum 
likelihood estimator 7 such that 



(1) 
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Morris [22] suggested the modified estimator 


rEB 


p- 2 dj 
P ^ + 7 + 


Xi 


( 2 ) 


In our implementation, the right-hand side of (1) is computed to update 7 from the 
initial guess, p~ 1 {Ylj^i(X 2 — dj)}+, for up to 100 iterations until the successive absolute 
difference in 7 is <10 -4 , or 7 is set to 00 so that <5 EB = X otherwise. 

Alternatively, Xie et al. [31] proposed empirical Bayes-type estimators based on mini¬ 
mizing Stein’s [27] unbiased risk estimate (SURE) under heteroscedasticity. Their basic 
estimator is defined componentwise by 


< 3 > 

where 7 is obtained by minimizing the SURE of ,5 Bayes , that is, SURE( 7 ) = X T D{D + 
r yI}~ 1 X + 27 tr{D(£> + 7 /) -1 } — tr(D). In general, the two types of empirical Bayes 
estimators, <5 EB and (f XKB , are non-minimax, as shown in Section 4. 

For a direct extension of <5 4S , consider the estimator = (1 — c/||X|| 2 )X and, more 
generally, Sf = {1 - r(\\X\\ 2 )/\\X\\ 2 }X, where c is a scalar constant and r(-) a scalar 
function. See Lehmann and Casella [21], Theorem 5.7, although there are some typos. 
Both <5® and <5® are spherically symmetric. The estimator <5| is minimax provided 


0 < c < 2{tr(H) — 2A max (D)}, (4) 

and (5® is minimax provided 0 < r(-) < 2{tr(£>) — 2A max (D)} and r(-) is nondecreasing. 
No such c ^ 0 exists unless tr (D) > 2X max (D), which restricts how much (di ,... ,d p ) can 
differ from each other. For example, condition (4) fails when p= 10 and 


d\ — 40, o ?2 — 20, g ?3 — 10, d 4 — • • • — dio — 1, (5) 


because tr (D) = 77 and A max (H) = 40. 

Berger [3] proposed estimators of the form <5 B = {I — cD -1 /(X T D~ 2 X)}X and <5 B = 
{I — r(X T D~ 2 X)/(X T D~ 2 X)D~ 1 }X, where c is a scalar constant and r(-) a scalar 
function. Then <5 B is minimax provided 0 < c < 2(p — 2), and <5 B is minimax provided 0 < 
r(-) < 2 (p — 2) and r(-) is nondecreasing, regardless of differences between (di, ..., d p ). 
However, a striking feature of <5 B and i5 B , compared with d EB and <5 XKB , is that the smaller 
dj is, the more Xj is shrunk. For example (5), under <5 B , the coordinates (Xi,X 2 ,X 3 ) 
are shrunk only slightly, whereas (X 4 ,..., X 10 ) are shrunk as if they were shrunk as a 
7-dimensional vector under <5 4S . The associated risk reduction is insubstantial, because 
the risk of estimating ( 04 ,..., 0 io) is a small fraction of the overall risk of estimating 9. 

Define the positive-part version of 5 B componentwise as 



cd • 1 \ 

_ 3 \ V 

X T D~ 2 Xj + 3 


(6) 
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The estimator 6®+ dominates <5® by Baranchik [1], Section 2.5. Berger [6], Equation 
(5.32), stated a different positive-part estimator, 5 B with r(t) = min(p — 2, t), but the jth 
component may not be of the same sign as Xj. 

Given a prior 9 ~ N(0,T), Berger [5] suggested an approximation of Berger’s [4] robust 
generalized Bayes estimator as 


5 rb = 


I — min< 1, 


V -2 


X T (D + T)~ 1 X j 


D(D + ry 


x. 


(7) 


The estimator is expected to provide significant risk reduction over So = X if the prior is 
correct and be robust to misspecification of the prior, but it is, in general, non-minimax. 
In the case of T = 0, <5 RB becomes {1 — (p—2)/(X T D~ 1 X)} + X, in the form of spherically 
symmetric estimators = {1 — r(X T D~ 1 X)/(X T D~ 1 X)}X, where r(-) is a scalar 
function (Bock [10], Brown [12]). The estimator (5® s is minimax provided 0 < r(-) < 
2{tr(Z?)/A max (H) — 2} and r(-) is nondecreasing. Moreover, if tr(D) < 2A max (H), then 
<5 r ss is non-minimax unless r(-) = 0. 

To overcome the non-minimaxity of (5 RB , Berger [5] developed a minimax estima¬ 
tor <5 MB by combining 5 B , <5 RB , and a minimax estimator of Bhattacharya [9]. Sup¬ 
pose that T = diag( 7 i,... ,y p ) and the indices are sorted such that d\ > • • ■ > d*, where 
d* =dPj/(dj + 7 j). Define <5 MB componentwise as 


Sf B = X 0 - 


1 


V 

-^J2( d *k^d* k+ i)min 
i k=j 


(k- 2 ). 


Ylt=iXt/(dt + 7 ^) 


-Xi 


■7 3 


( 8 ) 


where d p+1 = 0. In the case of T = 0, ^ MB reduces to the original estimator of Bhat¬ 
tacharya [9] . The factor (k — 2) + is replaced by 2(k — 2)+ in Berger’s [5] original definition 
of (5 MB , corresponding to replacing p — 2 by 2(p—2) in (5 RB . In our simulations, the two 
versions of <5 MB somehow yield rather different risk curves, and so do the corresponding 
versions of other estimators. But there has been limited theory supporting one version 
over the other. Therefore, we focus on comparisons of only the corresponding versions of 
(f MB and other estimators. 


3. Proposed approach 

We develop a useful approach for shrinkage estimation under heteroscedasticity, by mak¬ 
ing explicit how different coordinates are shrunk differently. The approach not only sheds 
new light on existing results, but also lead to new minimax estimators. 


3.1. A sketch 

Assume that E = D (diagonal) and Q — I. Consider estimators of the linear form 


6 = (I- A A)X = X- XAX, 


(9) 









6 


Z. Tan 


where A is a nonnegative definite, diagonal matrix indicating the direction of shrink¬ 
age and A is a scalar indicating the magnitude of shrinkage. Both A and A are to be 
determined. A sketch of our approach is as follows. 

(i) For a fixed A, the optimal choice of A in minimizing the risk is 

tr(DA) 

opt E e (X T A T AX)' 

(ii) For a fixed A and a scalar constant c > 0, consider the estimator 

Q 

$ A , c = X — x T A T AX AX 

By Theorem 1, an upper bound on the risk function of 5a, c is 


R(6 A ,c,e)<tv(D) + Eg 


c{c-2c*(D 1 A)}' 
X T A T AX ’ 


( 10 ) 


where c*(D,A) = tr (DA) — 2A max (-DA). Requiring the second term to be no 
greater than 0 shows that if c*(D,A) > 0, then S AjC is minimax provided 

0 < c < 2c*(D, A). (11) 


If c*(D,A) > 0, then the upper bound (10) has a minimum at c = c*(D,A). 

(iii) By taking c= c*(D,A) in 5 AjC , consider the estimator 


8 a = X 


c*{D,A) 

X T A T AX 


AX 


subject to c*(D,A) > 0, so that S A is minimax by step (ii). A positive-part esti¬ 
mator dominating 6 A is defined componentwise by 


)i = 



c*( y D,A)a 3 \ 
X T A T AX J + 3 ' 


( 12 ) 


where (ai,..., a p ) are the diagonal elements of A. The upper bound (10) on the 
risk functions of 5 A and 5\, subject to c*(D,A) > 0, gives 


R(5 A ,e)<tr{D)-E 9 


c* 2 (P, A) \ 
X T A T AX J 


(13) 


We propose to choose A based on some optimality criterion, such as minimizing 
the Bayes risk with a normal prior centered at 0 (Berger [5]). 

Further discussions of steps (i)—(iii) are provided in Sections 3.2-3.3. 
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We first develop steps (i)—(ii) for the general problem where neither E nor Q may be 
diagonal. The results can be as concisely stated as those just presented for the canonical 
problem where E is diagonal and Q = I. Such a unification adds to the attractiveness of 
the proposed approach. 

Consider estimators of the form (9), where A is not necessarily diagonal, but 

AYi is nonnegative definite. (14) 

Condition (14) is invariant under a linear transformation. To see this, let B be a non¬ 
singular matrix and E* = UE1? T and A* = BAB -1 . For the transformed problem of 
estimating 6 * = B9 based on X* = BX with variance matrix E*, the transformed esti¬ 
mator from (9) is 5* = X* — A A*X*. The application of condition (14) to 6 * says that 
T*E* = BAEB t is nonnegative definite and therefore is equivalent to (14) itself. For the 
canonical problem where E = D (diagonal), condition (14) only requires that AD is non¬ 
negative definite, allowing A to be non-diagonal. On the other hand, it seems intuitively 
appropriate to restrict A to be diagonal. Then condition (14) is equivalent to saying that 
A is nonnegative definite (and diagonal), which is the condition introduced on A in the 
sketch in Section 3.1. 

The risk of an estimator of the form (9) is 
E e {(X - 6 - A AX) t Q(X - 6 - A AX)} 

= E e {{X - 9) T Q{X - 0)} + X 2 E 9 (X t A t QAX) - 2A E 9 {(X - 6 ) T QAX}. 


For a fixed A, the optimal A in minimizing the risk is 

, E e {(X-9) T QAX} _ tr(E QA) 
opt E 9 (X t A t QAX) E 9 (X t A t QAX)' 


Replacing E 9 (X T A T QAX ) by X T A T QAX and tr(EQAl) by a scalar constant c > 0 leads 
to the estimator 


Sa,c x x t a t qax ax ' 


For a generalization, replacing c by r(X T A T QAX) with a scalar function r(-) > 0 leads 
to the estimator 


8 a,t =X- 


r(X T A T QAX) 
X T A T QAX 


We provide in Theorem 1 an upper bound on the risk function of 5a,t- 


Theorem 1 . Assume that r(-) almost differentiable (Stein [27]). If (14) holds and r(-) > 
0 is nondecreasing, then for each 9, 


R(6 A ,r,0)<tv^Q) + E 9 


r{r — 2c* (E, Q,A)} ' 
X T A T QAX \ ’ 


(15) 
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where r = r(A T A T QAX) and c*(E,Q, A) = tr(AE<3) — A max (AEQ + E A T Q). Taking 
r(-) = c>0 in (15) gives an upper bound on R(5a,c,0). 

Requiring the second term in the risk upper bound (15) to be no greater than 0 leads 
to a sufficient condition for 5a, r to be minimax. 

Corollary 1 . If (If) holds and c*(E ,Q,A) > 0, then 5 a, r is minimax provided 

0<r(-)<2c*(E,Q,A) and r(-) is nondecreasing. (16) 

Particularly, 5a,c is minimax provided 0 < c< 2c*(E,Q, A). 


For the canonical problem, inequality (15) and condition (16) for 5a,c give respectively 
(10) and (11). These results generalize the corresponding ones for <5® and 5f in Section 2, 
by the specific choices A = I or H -1 . The generalization also holds if c is replaced by 
a scalar function r(-) > 0. In fact, condition (16) reduces to Baranchik’s [2] condition in 
the homoscedastic case. 

If c*(T,,Q,A) > 0, then the risk upper bound (15) has a minimum at r(-) = c = 
c*(E,Q, A). As a result, consider the estimator 


S A = X 


c*(E ,Q,A) 
X T A T QAX 


which is minimax provided c*(E, Q , A) > 0. If A = <5 _1 E _1 (Berger [3]), then c*(E, Q , A) = 
p— 2 and, by the proof of Theorem 1 in the Appendix, the risk upper bound (15) becomes 
exact for 5a,c- Therefore, for A = E -1 , the estimator 5a = $A,p- 2 is uniformly best 

in the class 5a , c , in agreement with the result that 2 is uniformly best among in 
the homoscedastic case. 

The estimator 5a has desirable properties of invariance. First, 5a is easily shown to be 
invariant under a multiplicative transformation A <—> aA for a scalar a > 0. Second, 5a is 
invariant under a linear transformation of the inference problem. Similarly as discussed 
below (14), let B be a nonsingular matrix and E* = BT,B t , Q* = B T 1 QB~ 1 , and 
A* = BAB~ X . For the transformed problem of estimating 9* = B9 based on X* = BX , 
the transformed estimator from 5a is X* — {c*(E,<5, A)/(X* T A* T Q*A*X*)}A*X*, 
whereas the application of 5a is X* — {c*(E*, Q*, A*)/{X* T A* T Q*A*X*)}A*X*. The 
two estimators are identical because A*T,*Q* = BAYiQB~ x , E *A* T Q* = BT,A t QB~ 1 , 
and hence c*(E*, Q*, A*) = c*( E, Q , A). 

Finally, we present a positive-part estimator dominating 5a in the case where both 
AE and QA are symmetric, that is, 


AE = EA t and QA = A T Q. 


(17) 


Similarly to (14), it is easy to see that this condition is invariant under a linear trans¬ 
formation. Condition (17) is trivially true if E, Q , and A are diagonal. In the Appendix, 
we show that (17) holds if and only if there exists a nonsingular matrix B such that 
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Q = B T B, £ = B~ 1 DB t \ and A = B~ 1 A*B, where D and A* are diagonal and the 
diagonal elements of D or A* are, respectively, the eigenvalues of T,Q or A. In the fore¬ 
going notation, £* = D and Q* = I. For the problem of estimating 6 * = B 8 based on 
X* = BX , consider the estimator rj = X ~ {c*(D 1 A*)/( K X* T A* T A*X*)}A*X and the 
positive-part estimator r/ + with the jth component, 


1 - 


c*(D,A*) 

X* t A* t A*X* 


\r * 


where (aj,..., a*) are the diagonal elements of A*. The estimator r] + dominates r] by a 
simple extension of Baranchik [1], Section 2.5. By a transformation back to the original 
problem, r] yields 8 a , whereas r] + yields 


5\ = B 1 diag 


c*(S,Q,A) * 

X T A^QAX 0,1 


+ 


f c*(£,Q,A) * 

\ X' T A T QAX ap 


BX. 


-N 


Then 8 \ dominates 5a- Therefore, (15) also gives an upper bound on the risk of 8 \, with 
r(-) = c*(£, Q , A ), even though 5\ is not of the form 8a, r - 

In practice, a matrix A satisfying (17) can be specified in two steps. First, find a 
nonsingular matrix B such that Q = B T B and £ = B~ 1 DB T 1 , where D is diagonal. 
Second, pick a diagonal matrix A* and define A = B~ 1 A*B. The first step is always 
feasible by taking B = OC , where C is a nonsingular matrix such that Q = C T C and 
O is an orthogonal matrix O such that 0(CEC T )0 T is diagonal. Given (S,Q) and D, 
it can be shown that A and dj depend on the choice of A*, but not on that of B , 
provided that a* = a* k if dj = c4 for any j, k = 1,... ,p. In the canonical case where £ = D 
and Q — I, this condition amounts to saying that any coordinates of X with the same 
variances should be shrunk in the same way. 


3.3. Constructing estimators: Step (iii) 

Different choices of A lead to different estimators 5a and 5\. We study how to choose 
A, depending on (£,Q) but not on X , to approximately optimize risk reduction while 
preserving minimaxity for 5a- The estimator <5^ provides even greater risk reduction than 
5a- We focus on the canonical problem where £ = D (diagonal) and Q = I. Further, we 
restrict A to be diagonal and nonnegative definite. 

As discussed in Berger [4] , any estimator can have significantly smaller risk than 5q = X 
only for 6 in a specific region. Berger [4, 5] considered the situation where significant risk 
reduction is desired for an elliptical region 

{9: (18) 

with fj, and T the prior mean and prior variance matrix. See <5 RB and <5 MB reviewed in 
Section 2. To measure average risk reduction for 6 in region (18), Berger [5] used the 
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Bayes risk with the normal prior 0~N(/n,r). For simplicity, assume throughout that 
/i = 0 and r = diag( 7 i,... , 7 P ) is diagonal. 

We adopt Berger’s [5] ideas of specifying an elliptical region and using the Bayes risk to 
quantify average risk reduction in this region. We aim to find A, subject to c*(D,A) > 0, 
minimizing the Bayes risk of d A with the prior 717 , 9 ~ N(0,r), 

R{6 A ,TT T ) = E^E 0 (\\5 A -e\\ 2 ), 


where E ’ I ' r denotes the expectation with respect to the prior 717 . Given A, the risk 
R(5 A ,irr) can be numerically evaluated. A simple Monte Carlo method is to repeatedly 
draw 6 ~ N(0,r) and X\9 ~ N (9,D) and then take the average of ||<5 a(A’) — 9 1| 2 . But 
it seems difficult to literally implement the foregoing optimization. Alternatively, we 
develop a simple method for choosing A by two approximations. 

First, if c*(D, A) > 0, then taking the expectation of both sides of (13) with respect to 
the prior 717 gives an upper bound on the Bayes risk of 6 A \ 

R(6 A ,irr) < tr(£>) — E m \^ (19) 


where E m denotes the expectation with respect to the marginal distribution of X in the 
Bayes model, that is, X ~ N(0,H + F). An approximation strategy for choosing A is to 
minimize the upper bound (19) on the Bayes risk or to maximize the second term. The 
expectation E m {(X T A T AX)~ x } can be evaluated as a 1-dimensional integral by results 
on inverse moments of quadratic forms in normal variables (e.g., Jones [20]). But the 
required optimization problem remains difficult. 

Second, approximations can be made to the distribution of the quadratic form 
X T A T AX. Suppose that X T A T AX is approximated with the same mean by {Y7j = i(dj + 
7 j)&j}Xp/P> where Xp is a chi-squared variable with p degrees of freedom. Then 
E m {(X T A T AX) -1 } is approximated by {p/(p — 2)}{X^j= 1 (^i +7 j) a 'j}~ 1 ■ We show in 
the Appendix that this approximation gives a valid lower bound: 


Em {x^A^AX ) - p - 2 ' + T7K 2 ‘ (20) 

A direct application of Jensen’s inequality shows that E m {(X T A T AX)~ 1 } > {52 F j=\{dj + 
7 j)o^} _ 1 . But the lower bound (20) is strictly tighter and becomes exact when (di + 
7 i)af = • • • = (d p + 7 P )a p . No simple bounds such as (20) seem to hold if more complicated 
approximations (e.g., Satterthwaite [24]) are used. 

Combining (19) and (20) shows that if c*(D,A) > 0, then 


R ( S A , 717) < tr(D) 


P c* 2 (D,A) 

P~ 2 J2 P j=i(dj +7j)« 2 ' 


( 21 ) 


Notice that 6 A is invariant under a multiplicative transformation A 1 —» aA for a scalar 
a > 0, and so is the upper bound (21). Our strategy for choosing A is to minimize 
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the upper bound (21) subject to c*(D,A) > 0 or, equivalently, to solve the constrained 
optimization problem: 


p 

max c*(D,A)=y djOj — 2 max djCLj 
j=i 

p 

subject to + 7 j)aj = fixed. 

j=i 


( 22 ) 


The condition c*(D,A) > 0 is dropped, because for p > 3, the achieved maximum is at 
least c* ( D , all -1 ) = a(p — 2) > 0 for some scalar a > 0. In spite of the approximations used 
in our approach, Theorem 2 shows that not only the problem (22) admits a non-iterative 
solution, but also the solution has a very interesting interpretation. For convenience, 
assume thereafter that the indices are sorted such that d\/(d\ + 71 ) > d\/(d 2 + 72 ) > 

''' — d P /(dp + 7p) • 


Theorem 2. Assume thatp >3 , D = diag(di,..., d p ) with dj > 0 and T = diag( 7 i,..., 7 P ) 
with 7 j >0 (j = 1,... ,p). For problem (22), assume that A = diag(ai, ..., a p ) with aj > 0 
(j = l,...,p) and i(dj + 7 j) a j = E^=i d V ( d j + 7 j)> satisfied by a 3 = dj/(dj + 7 ,-)• 
Then the following results hold. 

(i) There exists a unique solution, A f = diag(a{,... ,a|,), to problem (22). 

(ii) Let v be the largest index such that d u a\, = max(dia|,... ,d p aj,). Then o>3, 
d\a\ = • ■ ■ = d„aj, > djOj for j > v + 1, and 


J _ 


=ME 


\k =1 


dk+7k \ v-2 
dl ) di 


(j = l,...,u), 


a\=K v 


dj + 7 j 


(j = v + l,...,p), 


where K v = {E^=i d2 jK d j + 7 j)} 1 / 2 M„ and 


M v 


(v-2fi 


ELi(di+7i)/d? 


V 

- E ; 

j=v+1 


d : 


■71 


T/ie achieved maximum value, c*(D,A^), is K V M V (> 0). 

(iii) The resulting estimator 8 a t minimax. 


We emphasize that, although T can be considered a tuning parameter, the solution A^ 
is data independent , so that 8 a t is automatically minimax. If a data-dependent choice 
of A were used, minimaxity would not necessarily hold. This result is achieved both 
because each estimator 8 a with c*(D,A) > 0 is minimax and because a global criterion 
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(such as the Bayes risk) is used, instead of a pointwise criterion (such as the frequentist 
risk at the unknown 6), to select A. By these considerations, our approach differs from 
the usual exercise of selecting a tuning parameter in a data-dependent manner for a class 
of candidate estimators. 

There is a remarkable property of monotonicity for the sequence (M 3 , M 4 ,. .., M p ), 
which underlies the uniqueness of v and . 

Corollary 2. The sequence (M 3 , M 4 ,... , M p ) is nonincreasing: for 3 < k < p — 1, M\ > 
Mfc_|_i, where the equality holds if and only if 


k 2 __ ^fc+i 


T, k j= M + 7 j)/d 2 j d k +1 + 7fc+i 

The condition d v a\, > d v +ia \ +1 is equivalent to saying that the left side is greater than 


the right-hand side in the above expression for k = u. Therefore, v is the smallest index 
3 < k < p — 1 with this property, and > M u+ 1 . 

The estimator 5^ is invariant under scale transformations of A^. Therefore, the con¬ 
stant K v can be dropped from the expression of A 1 in Theorem 1. 


Corollary 3. The solution A 1 = diag(a{,..., aj)) can be rescaled such that 




(23) 



(j = u + l,...,p). 


(24) 


Then c*(D,A' i ) = X0j=i a ] ( d j + 7 j) = Moreover, it holds that 



(25) 


The estimator can be expressed as 



( 26 ) 


The foregoing results lead to a simple algorithm for solving problem (22): 
(i) Sort the indices such that d\/(d\ + 71 ) > • • • > d p / ( d p + y p ). 
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(ii) Take v to be the smallest index k (corresponding to the largest M k ) such that 
3 < k < p — 1 and 

_ k ~ 2 _ > d l+i 

T, k j=i(dj + 7 j)/dj + ^+i ’ 

or take v =p if there exists no such k. 

(iii) Compute (aj,...,aj) by (23)-(24). 

This algorithm is guaranteed to find the (unique) solution to problem (22) by a fixed 
number of numerical operations. No iteration or convergence diagnosis is required. There¬ 
fore, the algorithm is exact and non-iterative, in contrast with usual iterative algorithms 
for nonlinear, constrained optimization. 

The estimator 8 A \ bas an interesting interpretation. By (23)-(24), there is a di¬ 
chotomous segmentation in the shrinkage direction of the coordinates of X based on 
d* = c(j/(dj + 7 j). This quantity d* is said to reflect the Bayes “importance” of 6 j, that 
is, the amount of reduction in Bayes risk obtainable in estimating Oj in Berger [5]. The 
coordinates with high d* are shrunk inversely in proportion to their variances dj as in 
Berger’s [3] estimator J®, whereas the coordinates with low d* are shrunk in the direction 
of the Bayes rule. Therefore, 5a t mimics the Bayes rule to reduce the Bayes risk, except 
that 5^ mimics 6 ® for some coordinates of highest Bayes “importance” in order to 
achieve minimaxity. In fact, by inequality (25), the relative shrinkage, a'j/{dj/(dj + 7 j)}, 
of each Xj (j = 1,... ,u) in 5 A i versus the Bayes rule is always no greater than that of 
X k (k = u + l,...,p). 

The expression (26) suggests that there is a close relationship in beyond the shrinkage 
direction between S A t and the Bayes rule under the Bayes model, X ~ N(0, D + T). 
In this case, £ m (X^ =1 °J X?) = YJj=i a ] ( d j +7 j), and hence S A i behaves similarly 
to X — A^X. Therefore, on average under the Bayes model, the coordinates of X are 
shrunk in the same as in the Bayes rule, except that some coordinates of highest 
Bayes “importance” are shrunk no greater than in the Bayes rule. While this discussion 
seems heuristic, we provide in Section 3.4 a rigorous analysis of the Bayes risk of , 
compared with that of the Bayes rule. 

We now examine for two types of priors: 71 = • • • = 7 P = 7 and 7 j = 7 dj (j = 
referred to as the homoscedastic and hcteroscedastic priors. For both types, 
(c££,..., d*) are of the same order as the variances (d \,..., d p ). Recall that 5 A is invariant 
under a multiplicative transformation of A. For both the homoscedastic prior with 7 = 0 
and the heteroscedastic prior regardless of 7 > 0 , the solution A^ = diag(aj,..., a) p ) can 
be rescaled such that 




v-2 



a j = 1 (j = z/+!,...,p). 


(j = l,...,v), 
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Denote by A J this rescaled matrix A\ corresponding to T = 0. Then coordinates with 
high variances are shrunk inversely in proportion to their variances, whereas coordinates 
with low variances are shrunk symmetrically. For T = 0, the proposed method has a purely 
frequentist interpretation: it seeks to minimize the upper bound ( 21 ) on the pointwise 
risk of 5a at 9 = 0. 

For the homoscedastic prior with 7 —> 00 , the proposed method is then to minimize the 
upper bound (21) on the Bayes risk of 5a with an extremely flat, homoscedastic prior. 
As 7 —> 00 , the solution A' can be rescaled such that 

«!=(XX -2 ) ^ (J = l, 

a j= d J (j = JZ+l,...,p). 

Denote by A ^ this rescaled matrix A t. Then coordinates with low (or high) variances are 
shrunk directly (or inversely) in proportion to their variances. The direction A ^ can also 
be obtained by using a fixed prior in the form 7 j = "/d± — dj (j = 1 ,.. . ,p) for arbitrary 
7 > 1 , where d± =maxj—i^__ tP dj. 

Finally, in the homoscedastic case (d\ = • • • = d p = er 2 ), if the prior is also homoscedas¬ 
tic (71 = • ■ ■ = 7p = 7 ), then v = p, a{ = • ■ ■ = a],, and 5At reduces to the James-Stein 
estimator S p ^_ 2 j regardless of a 2 and 7 . 

3.4. Evaluating estimators 

The estimator 5At is constructed by minimizing the upper bound (21) on the Bayes risk 
subject to minimaxity. In addition to simplicity, interpretability, and minimaxity demon¬ 
strated for 5At , it remains important to further study risk properties of 5At and show 
that SAt can provide effective risk reduction over 5q = X. Write = chtt(r) whenever 
needed to make explicit the dependency of A t on T. 

First, we study how close the Bayes risk of ^qr) can be to that of the Bayes rule, 
which is the smallest possible among all estimators including non-minimax ones, under 
the prior 7 Tr, 9 ~ N(0,r). The Bayes rule (5p ayes is given componentwise by (^r ay6S )l = 
{1 — dj/(dj +7 j)}Xj 1 with the Bayes risk 

i?(cip ayes , 7r r ) = tr(D) - ^ d*, 
j =1 

where d* = d 2 /(dj + 7 ?), indicating the Bayes “importance” of Qj (Berger [5]). The upper 
bound (21) on the Bayes risk of (5^t(r) gives 

fl{W>.'r} <tr(D) - ^M„ = tr(D) - _^| + ^d*J, (27) 
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because c*(D, A t ) = Y%=i(dj +lj) a ] = and hence c* 2 (D, ^4 t )/{^-i(^'+7?') a ] } = 
M v by Corollary 3. It appears that the difference between i?{<$At(r)> 7Tr} and I?(dp ayes , 
7 rr) tends to be large if v is large. But d* > ■ ■ ■ > d* cannot differ too much from each 
other because by Corollary 1, 

k-2<J2^< k (fc = 3,1). 
i=i 1 

Then the difference between -R{< 5 J 4 t(r)) 7 r r} and i?( 5 p ayes , 7 Tr) should be limited even if v 
is large. A careful analysis using these ideas leads to the following result. 


Theorem 3. Suppose that the prior is 9 ~ N(0,T). If v = 3, then 

«{«X.(T),W} < tr(D) -Y.d‘ + U ~ - TZ ^ 


.7—3 

P 


p-2^ J p — 2 3 

j=4 


<tr(£)-^d* + -^. 


.7=3 


If v> 4, t/ien 


<MD)- X>; + K + <5 - tAjXA - - :(r 


J—3 
P 


p — 2 J p — 2 v 

3 —3 


<tr(D)-^d* + (^ + ^). 

i=3 


Throughout, an empty summation is 0. 


(28) 

(29) 


(30) 

(31) 


There are interesting implications of Theorem 3. By (29) and (31), 

^{<W),7Tr} < i?(<5p ayes ,7r r ) + (dl + d * 2 + d * 3 + d%). (32) 

Then 5 J 4 t(r) achieves almost the minimum Bayes risk if d\/{tr{D) — Y^=i dj} ~ 0- In 
terms of Bayes risk reduction, the bound (32) shows that 

tr(-D) - R{5 AHr) ,n r } > (l - ^ +g +^+' d *^ {tr(J ) _ i?(<^ )7r p)}. 

Therefore, (r) achieves Bayes risk reduction within a negligible factor of that achieved 
by the Bayes rule if d\f d* ~ 0. 

In the homoscedastic case where both D = a 2 1 and T = 7 /, <5^ reduces to 6 ^ 2 , 
regardless of 7 > 0 (Section 3.3). Then the bounds (28) and (30) become exact and 
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give Efron and Morris’s [17] result that R{5^_ 2 ,7 t 7 j) = tr (D) — (p — 2){cr 4 /(cr 2 + 7 )} or 
equivalently tr(£>) - R{6%L 2 ,7 t t j) = (1 - 2/p){ti{D) - R(5®j yes , tt 7 /)}. 

It is interesting to compare the Bayes risk bound of with that of the following 

simpler version of Berger’s [5] estimator d MB : 


6f B2 = Xj- 


— Vf d*-d* ) ( k ~ T>+ \ 

d ih k Eti x !/(dt + 'rt)l d j +'rj 


J X r 


By Berger [5], d MB2 is minimax and 


d* (E 1 


fe= 1 


^ MB2 , -r) = tr(D) - E^- 2 Ef l-73lE 
1=3 1=3 J \ J 

<tr (D)-J2d* r 

3 =3 


(33) 

(34) 


There seems to be no definite comparison between the bounds (28) and (30) on 
R{d A Hr),^r} and the exact expression (33) for R(S MB2 : 7 rr), although the simple bounds 
(29) and (31) is slightly higher, by at most + d%, than the bound (34). Of course, each 
risk upper bound gives a conservative estimate of the actual performance, and compar¬ 
ison of two upper bounds should be interpreted with caution. In fact, the positive-part 
estimator <Tj[ t yields lower risks than those of the non-simplified estimator d MB in our 
simulation study (Section 4). 

The simplicity of 6 ^t and dj[ t makes it easy to further study them in other ways than 
using the Bayes (or average) risk. No similar result to the following Theorem 4 has been 
established for <5 MB or d MB2 . Corresponding to the prior N(0,T), consider the worst-case 
(or maximum) risk 

R(S, "Hr) = sup R(S,8) 
een r 

over the hyper-rectangle "Hr = {8- 8 2 < 7 j, j = 1,... ,p} (e.g., Donoho et al. [16]). Apply¬ 
ing Jensen’s inequality to (13) shows that if c*(D,A) > 0, then 

R(6 a , 8) < tr (D) - J dj ’ + 02 )a 2' 


which immediately leads to 


R{6 A ,H r )<ti(D) 


c* 2 (D, A) 

E: Mj • A)"-' 


(35) 


By the discussion after (20), a direct application of Jensen’s inequality to (19) shows 
that the Bayes risk R(6 A ,TTr) is also no greater than the right-hand side of (35), whereas 
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inequality (20) leads to a strictly tighter bound (21). Nevertheless, the upper bound (35) 
on the worst-case risk of 5At ( F ) gives 

R{S Am ,Hr}<tT(D)-M v =tT(D)- 1 + £ d*\, 

{ 1 a j j=v+ 1 J 

similarly as how (21) leads to (27) on the Bayes risk of <5 J 4t(r)- Therefore, the following 
result holds by the same proof of Theorem 3. 

Theorem 4. Suppose that Hr = {9- 0? < 7 j,j = 1,. .. ,p}. If v = 3, then 

p 0 

-R{<^4t(r) > Rr} < t r(D) - ^ d* + -d%. 


If v>A, then 


R{S A Hr),n r } < tr (D) - j^d* + (d* 3 + d* 4 - 4 

.7—3 ^ V ' 

<tr {D)-Y^d* + {dl + d\). 

J=3 

There are similar implications of Theorem 4 to those of Theorem 3. By Donoho et 
al. [16], the minimax linear risk over Hr, R L {Hr) = inf,5ii n ear-R(5,'Hr), coincides with 
the minimum Bayes risk i?(<5p ayes , 7Tr), and is no greater than 1.25 times the minimax 
risk over Hr, R N (Hr) = inf^ R(S, Hr)- These results are originally obtained in the ho- 
moscedastic case (d\ =••• = d p ), but they remain valid in the heteroscedastic case by 
the independence of the observations Xj and the separate constraints on 9j. Therefore, 
a similar result to (32) holds: 

i?{^ t( r),^r} < R L (H r ) + (d\ +d* 2 +d* 3 + d* 4 ) 

< 1.25R N (Hr) + K + d* 2 + d* 3 + d* 4 ). 

K dl/{tr(D) - EU d*} w 0, then S A t achieves almost the minimax linear risk (or the 
minimax risk up to a factor of 1.25) over the hyper-rectangle Hr, in addition to being 
globally minimax with 9 unrestricted. 

The foregoing results might be considered non-adaptive in that <5^7 (r) is evaluated with 
respect to the prior N(0,T) or the parameter set Hr with the same T used to construct 
5 J 4t(r) • But, by the invariance of 6 a under scale transformations of A, S A t(r) is identical to 
the estimator, <5 J 4t(r a )) that would be obtained if T is replaced by r a = a(D + T) — D for 
any scalar a such that the diagonal matrix r a is nonnegative definite. By Theorems 3-4, 
this observation leads directly to the following adaptive result. In contrast, no adaptive 
result seems possible for <5 MB . 
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Corollary 4. Let T a = a(D + T) — D and ao = maxj = i ,, jP {dj/(dj + 7 ^)} (< 1). Then 
for each a > ao, 


max[i?{5 j 4 t(r)) 7 r r a },-R{<5At(r)^r c ,}] < i?( 5 p^ yes , 7 r rct ) + a 1 (d{ +d^ + d^ + d\) 

= R L {Hr a ) + a- 1 {d* 1 +d * 2 + d * 3 + d * i ), 

where i?(<5p^ yes ,7rr Q ) = tr(D) — a ~ 1 Y^j= 1 dj. 

For fixed T, i^qr) can achieve close to the minimum Bayes risk or the minimax linear 
risk with respect to each prior in the class (N(0,r a ): a > ao} or each parameter set 
in the class { TLr a '■ a > ao} under mild conditions. For illustration, consider the case 
of a heteroscedastic prior with T ex D. Then {r a : a > ao} can be reparameterized as 
{"fD: 7 > 0}. By Corollary 4, for each 7 > 0, 

max{i?(<5 A t,7r 7 D),i?(^ j4 t,'H 7 D)} < R(S^ es ,% jD ) + dl + d ^ + d3+d4 , 

where R(8^ es , , k 1 d) = { 7 /( 1 + 7 )} tr(Z?) and d\ > d 2 > • • • > d p . Therefore, if d\/ tr(£>) ~ 
0, then 5 A \ achieves the minimum Bayes risk, within a negligible factor, under the prior 
N(0,7-D) for each 7 > 0. This can be seen as an extension of the result that in the 
homoscedastic case, 6^2 asymptotically achieves the minimum Bayes risk under the 
prior N(0,7 1) for each 7 > 0 as p —> 00. 

Finally, we compare the estimator 5 A t with a block shrinkage estimator, suggested by 
the differentiation in the shrinkage of low- and high-variance coordinates by 8 A \ ■ Consider 
the estimator 


cblock _ f d?- 2 (Xi , • ■ ■, X T ) 1 

U B _ T _ 2 (x T+1 ,...,x p )J’ 

where r is a cutoff index, and 8 f(Y) = Y if Y is of dimension 1 or 2. The index r can 
be selected such that the coordinate variances are relatively homogeneous in each block. 
Alternatively, a specific strategy for selecting r is to minimize an upper bound on the 
Bayes risk of (5 block , similarly as in the development of 8 A \ . Applying (21) with A = D~ 1 
to Sp _ 2 in the two blocks shows that R(S b>ock , 717) < tr(H) — L r , where 

k—2 p—k— 2 

(1 A) Ej=i ( d j + 7 j)/d) + (!/(P - fc )) Ej= fe+ i (dj + 7 j)/d ]' 

The first (or second) term in Lk is set to 0 if k < 2 (or k > p — 2). Then r can be defined 
as the smallest index such that L T = max(Li, L 2 , • • ■, L p ). But the upper bound (27) on 
R( 8 A i , tit) is likely to be smaller than the corresponding bound on R(S hlock , nr), because 
{k/(k — 2 )}Mk > Lk for each k > 3 by the Cauchy-Schwarz inequality ()Cj_fc+i d 2 j/(dj + 
Tj)}{Y^j=k+i(dj +7i)/^j} — (P~k) 2 . Therefore, S A t tends to yield greater risk reduction 
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than ^ block . This analysis also indicates that can be advantageous over (5 block extended 
to multiple blocks. 

The rationale of forming blocks in t and <5 block differs from that in existing block 
shrinkage estimators (e.g., Brown and Zhao [13]). As discussed in Cai [14], block shrink¬ 
age has been developed mainly in the homoscedastic case as a technique for pooling 
information: the coordinate means are likely to be similar to each other within a block. 
Nevertheless, it is possible to both deal with heterogeneity among coordinate variances 
and exploit homogeneity among coordinate means within individual blocks in our ap¬ 
proach using a block-homoscedastic prior (i.e., the prior variances are equal within each 
block). This topic can be pursued in future work. 


4. Simulation study 

4.1. Setup 

We conduct a simulation study to compare the following 8 estimators, 

(i) Non-minimax estimators: 6 EB by (2), <5 XKB by (3), S RB by (7) with T = 0; 

(ii) Minimax estimators: <5 B _j~ 2 by (6), <5 MB by (8) with T = 0 or 7/ for some large 7, 

8 \ by (12) with A = A\ and Al 0 . 

Recall that Aj corresponds to F = 0 or F oc D and A^ corresponds to T = 7/ with 
7 — > 00. In contrast, letting the diagonal elements of T tend to 00 in any direction in 
5 RB and <5 MB leads to So = X. Setting T to 0 or 00 is used here to specify the relevant 
estimators, rather than to restrict the prior on 9. 

For completeness, we also study the following estimators: ^p_ 2 ) by (6), <5 RB with 
p — 2 replaced by 2 (p — 2) in (7), 5 MB with (k — 2) + replaced by 2 (k — 2) + in (8), and 
St with c*(D,A) replaced by 2 c*{D,A) in (12), referred to as the alternative versions of 
8 p*. 2, <5 RB , <5 MB , and S\ respectively. The usual choices of the factors, p— 2, (k — 2) + , 
and c*(D,A), are motivated to minimize the risks of the non-positive-part estimators, 
but may not be the most desirable for the positive-part estimators. As seen below, the 
alternative choices 2(p — 2), 2(k — 2) + , and 2c* (D, A) can lead to risk curves for the 
positive-part estimators rather different from those based on the usual choices (p — 2), 
(k — 2) + , and c*(D,A). Therefore, we compare the estimators <5 B + 2 , d RB , <5 MB , and S\ 
and, separately, their alternative versions. 

Each estimator 5 is evaluated by the pointwise risk function R(S, 9) as 9 moves 
in a certain direction or the Bayes risk function -R(<5,7r) as 7r varies in a set of pri¬ 
ors on 9. Consider the homoscedastic prior N(0,?y 2 //p) or the heteroscedastic prior 
N{0 ,t/ 2 D/tr(D)} for 77 > 0. As discussed in Section 3.3, the Bayes risk with the first or 
second prior is meant to measure average risk reduction over the region {9: ||0|| 2 < ?7 2 } or 
{9: 9 t D~ 1 9 < pp 2 / tr(D)}. Corresponding to the two priors, consider the direction along 
(77/ y/p, ... ,77/ y/p) or (py/di, ... ^rfy/dp)/y/ti(D) , where 77 gives the Euclidean distance 
from 0 to the point indexed by rj. The two directions are referred to as the homoscedastic 
and heteroscedastic directions. 
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We investigate several configurations for D, including (5) and 

(rfi.rfa.dio) = (40,20,10,5,5,5,1,1,1,1) or (36) 

= (40,20,10,7,6,5,4,3,2,1) or (37) 

= 5%, 15%,..., 95% quantiles of 8/X3 or 24/xJ, 

where xt is a chi-squared variable with k degrees of freedom. In the last case, {d \,..., dio) 
can be considered a typical sample from a scaled inverse chi-squared distribution, which 
is the conjugate distribution for normal variances. In the case (36), the coordinates may 
be segmented intuitively into three groups with relatively homogeneous variances. In the 
case (37), there is no clear intuition about how the coordinates should be segmented into 
groups. 

For fixed D , the pointwise risk R(5, 9) is computed by repeatedly drawing X ~ N(0, D ) 
and then taking the average of ||d — 6 \\ 2 . The Bayes risk is computed by repeatedly 
drawing 9 ~ N(0,T) and X\9 ~ N (9,D) and then taking the average of ||<5 — 9 1| 2 . Each 
Monte Carlo sample size is set to 10 5 . 

4.2. Results 

The relative performances of the estimators are found to be consistent across different 
configurations of D studied. Moreover, the Bayes risk curves under the homoscedastic 
prior are similar to the pointwise risk curves along the homoscedastic direction. The 
Bayes risk curves under the heteroscedastic prior are similar to the pointwise risk curves 
along the heteroscedastic direction. Figure 1 shows the pointwise risks of the estimators 
with the usual versions of <5®+ 2 i <5 RB j £ MB , and and Figure 2 shows those of the 
estimators with the alternative versions of <S , d RB , <5 MB , and 8 \ for the case (36), with 
roughly three groups of coordinate variances, which might be considered unfavorable 
to our approach. For both 4J and 4^, the cutoff index v is found to be 3. See the 
supplementary material (Tan [30]) for the Bayes risk curves of all these estimators for 
the case (36) and the results for other configurations of D. 

A number of observations can be drawn from Figures 1-2. First, d EB , <5 XKB , and 
d RB have among the lowest risk curves along the homoscedastic direction. But along the 
heteroscedastic direction, the risk curves of d EB and d XKB rise quickly above the constant 
risk of X as r] increases. Moreover, all the risk curves of 5 EB , d XKB , and d RB along the 
9\ axis exceed the constant risk of X as |0i| increases. Therefore, <5 EB , (5 XKB , and <5 RB 
fail to be minimax, as mentioned in Section 2. 

Second, 5 B jt 2 or <5 B p_ 2 ^ has among the highest risk curve, except where the risk curves 
of d EB and d XKB exceed the constant risk of X along the heteroscedastic direction. 
The poor performance is expected for d B jt 2 or , because there are considerable 

differences between the coordinate variances in (36). 

Third, among the minimax estimators, 6 ^ with A = Aq or 4^ has the lowest risk curve 
along various directions, whether the usual versions of S^ 2 ■ ^ MB , and 8 ^ are compared 
(Figure 1) or the alternative versions are compared (Figure 2). 
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Figure 1. Pointwise risks along the homoscedastic (first row) and heteroscedastic (second row) 
directions and 9\ axis (third row) in the case (36). Left: non-minimax estimators <5 EB (V), <5 RB 
(Y), <5 xkb (a). Right: minimax estimators <5 B _t 2 (a), <5 MB with P = 0 (•) and T = (16 2 /p)J (o), 
with A = Aq (■) and ^4^ (□). 
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Pointwise risk (8j=r|/Vp) Pointwise risk (9j=riA/p) 



Pointwise risk (8,Pointwise risk (9, °c Jdj) 




Pointwise risk (8, axis) Pointwise risk (8, axis) 




Figure 2. Pointwise risks along the homoscedastic (first row) and heteroscedastic (second row) 
directions and 9 1 axis (third row) in the case (36), with the same legend as in Figure 1. The 
alternative versions of <5 B _t 2 , <5 RB , <S MB , and 5^ are used. 
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Fourth, the risk curve of 8 \ with A = Aj is similar to that of c> ^ with A = A^ along the 
heteroscedastic direction. But the former is noticeably higher than the latter along the 
homoscedastic direction as 77 increases, whereas is noticeably lower than the latter along 
the Q\ axis as \ 8 \\ increases. These results agree with the construction of A J using a het¬ 
eroscedastic prior and AJ^ using a flat, homoscedastic prior. Their relative performances 
depend on the direction in which the risks are evaluated. 

Fifth, <5 MB with T = 0 has risk curves below that of or but either above 

or crossing those of 8 \ with A = Aj and A Moreover, 5 MB with T = (16 2 /p)I has 
elevated, almost flat risk curves for 77 from 0 to 16. This seems to indicate an undesirable 
consequence of using a non-degenerate prior for d MB in that the risk tends to increase 
for 6 near 0, and remains high for 6 far away from 0. 

The foregoing discussion involves the comparison of the risk curves as 8 moves away 
from 0 between <5 MB and <Tj( t specified with fixed priors. Alternatively, we compare the 
pointwise risks at 8 = {r]/y/P> ■ ■ ■ iV/y/p) or (vVdii ■ ■ ■ ,77y / d^)/y / tr(U) and the Bayes risks 
under the prior N(0, r] 2 I/p) or N{0,77 2 Z)/ tr(H)} between <5 MB and 5 specified with the 
prior N(0,77 2 //p) for a range of 77. The homoscedastic prior used in the specification of 
(5 MB and <Tj[ t can be considered correctly specified or misspecihed, when the Bayes risks 
are evaluated under, respectively, the homoscedastic or heteroscedastic prior or when the 
pointwise risks are evaluated along the homoscedastic or heteroscedastic direction. For 
each situation, has lower pointwise or Bayes risks than (5 MB . See Figure A2 in the 

supplementary material (Tan [30]). 


5. Conclusion 

The estimator 8 A \ and its positive-part version <Tj( t are not only minimax and but also 
have desirable properties including simplicity, interpretability, and effectiveness in risk 
reduction. In fact, <5^ is defined by taking A = A^ in a class of minimax estimators 8 a- 
The simplicity of 8 A t holds because 8 a is of the linear form (/ — A A)X, with A and A 
indicating the direction and magnitude of shrinkage. The interpretability of S A t holds 
because the form of A^ indicates that one group of coordinates are shrunk in the direction 
of Berger’s [3] minimax estimator whereas the remaining coordinates are shrunk in the 
direction of the Bayes rule. The effectiveness of in risk reduction is supported, in 
theory, by showing that 8 can achieve close to the minimum Bayes risk simultaneously 
over a scale class of normal priors (Corollary 4). For various scenarios in our numerical 
study, the estimators i5^ t with extreme priors yield more substantial risk reduction than 
existing minimax estimators. 

It is interesting to discuss a special feature of 8 A ,r and hence of 8a,c and 8a among 
linear, shrinkage estimators of the form 

S = X ~h{X T BX)AX, (38) 

where A and B are nonnegative definite matrices and h(-) is a scalar function. The esti¬ 
mator 8a, r corresponds to the choice B oc A T QA, which is motivated by the form of the 
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optimal A in minimizing the risk of (/ — A A)X for fixed A. On the other hand, Berger 
and Srinivasan [8] showed that under certain regularity conditions on h(-), an estimator 
(38) can be generalized Bayes or admissible only if B oc X _1 A. This condition is incom¬ 
patible with B oc A t QA, unless Aoc. <3 _1 X _1 as in Berger’s [3] estimator. Therefore, 5 a 
including 5a] is, in general, not generalized Bayes or admissible. This conclusion, how¬ 
ever, does not apply directly to the positive-part estimator , which is no longer of the 
linear form (/ — A A)X. 

There are various topics that can be further studied. First, the prior on 6 is fixed, 
independently of data in the current paper. A useful extension is to allow the prior to be 
estimated within a certain class, for example, homoscedastic priors iV(0,7/), from the 
data, in the spirit of empirical Bayes estimation (e.g., Efron and Morris [17]). Second, the 
Bayes risk with a normal prior is used to measure average risk reduction in an elliptical 
region (Section 3.3). It is interesting to study how our approach can be extended when 
using a non-normal prior on 0, corresponding to a non-clliptical region in which risk 
reduction is desired. 


Appendix 

Preparation. The following extends Stein’s [27] lemma for computing the expectation 
of the inner product of X — 9 and a vector of functions of X. 

Lemma 1. Let X = (Xi,...,X p ) T be multivariate normal with mean 9 and variance 
matrix X. Assume that g= (gi,..., g p ) T :1Z P ^ 1Z P is almost differentiable Stein [27] 
with Eg{\X jgi(X)\} < oo for i,j = 1,... ,p, where Vj = d/dxj. Then 

Ee{(X - 9) T g(X)} = tr[X E e {X7g(X)}}, 

where V g(x) is the matrix with ( i,j)th element V jgi(x). 

Proof. A direct generalization of Lemma 2 in Stein [27] to a normal random vector with 
non-identity variance matrix gives 

E e {(X - 9)g t (X)} = EEj{Xg t (X)}, 

where V gi(x) is the row vector with j th element V jgi(x). Taking the ith element of both 
sides of the equation gives 


p 

Eo{(Xi - 9,) gi {X)} = Y, VijEeiVjgi (X)}, 
j =i 


where cr,; ? is the (i, j)th element of X. Summing both sides of the preceding equation over 
i gives the desired result. □ 
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Proof of Theorem 1. By direct calculation, the risk of 6 a, r is 

R(S A , r ,e) = tr(£Q) + - 2 E e {(X - }• 

By Lemma 1 and the fact that tr(Y,QAXX T A T QA) = X T A T QAXQAX , the third term 
after the minus sign in R(5A,r,0) is 


2 Eglr 


tr(EQA) } 
X T A T QAX J 


- AE e l r 


X t A t QAY,QAX\ / , X t A t QAT,QAX \ 

(X T A T QAX ) 2 J + 6 V X T A T QAX J' 


By condition (14), A T QA"EQA is nonnegative definite. By Section 21.14 and Exercise 
21.32 in Harville [18], (x T A T QAT,QAx) /(ir T A T QAx) < A max (dEQ + EH T Q) /2 for x ^ 0. 
Then the preceding expression is bounded from below by 

/ tr(EQd)-A max (HEQ + Ed T Qn 
2J H r - X^QAX -/' 


which leads immediately to the upper bound on R(SA,r,d). 


□ 


Proof for condition (17). We show that if condition (17) holds, then there exists a 
nonsingular matrix B with the claimed properties. The converse is trivially true. Let 
R be the unique symmetric, positive definite matrix such that R 2 = Q. Then RAR~ l 
is symmetric, that is, RAR~ X = R~ 1 A T R , because QA = A T Q. Moreover, RT,R and 
RAR -1 commute, that is, RAR~ 1 (RT,R) = iffii?(.RAR _1 ) T = RT,R(RAR~ 1 ), because 
HE = EA t and RAR~ l is symmetric. Therefore, RT,R and RAR 1 are simultaneously 
diagonalizable (Harville [18], Section 21.13). There exists an orthogonal matrix O such 
that 0(RT,R)O t = D and 0(RAR~ 1 )0 T = A* for some diagonal matrices D and A*. 
Then B = OR satisfies the claimed properties. □ 


Proof of inequality ( 20 ). We show that if (Zi,..., Z p ) are independent standard nor¬ 
mal variables, then E{(^ =1 a?Z ?)“ 1 } > {p/(p - 2 )}(^ =1 af) _1 - Let S = J2j =1 Z j- 
Then S and (Zf/S,...,Z p /S) are independent, S ~ Xp; and ( z i/S ,..., Z^/S) ~ 
Dirichlet(l/p,..., 1/p). The claimed inequality follows because Z{Q ^ =1 a:-Z|) _1 } = 
E{(E P j= I^ZJ/S)- 1 }^- 1 ), E(S~ 1 ) = l/(jp — 2 ), and E{{^U ap]/S)^}> 
(Sj=i tf/p)- 1 by Jensen’s inequality. □ 

Proofs of Theorem 2 and Corollary 2. Consider the transformation Sj = d 2 / (d :] + 7 j) 
and ctj = {(dj + 'Yj)/dj}aj, so that SjOtj = djOj and Sjaq = (dj + r )j)a 2 . Problem (22) is 

then transformed to maxoj. ap {Ej =1 ^j a i — 2max(Jiau,..., S p a p )}, subject to ctj > 0 

(j = 1 ,... ,p) and Y^ P j =1 ^j a j = Sj=i which is of the form of the special case of ( 22 ) 
with 7 j = 0 {j = 1,... ,p). But it is easy to verify that if the claimed results hold for the 
transformed problem, then the results hold for original problem (22). Therefore, assume 
in the rest of proof that = 0 (j = 1 ,... ,p). 
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There exists at least a solution, A\ to problem (22) by boundedness of the constraint 
set. Let JC = {k: dfcaj. = k = 1,... ,p} and JC C = {j: djOj < d„a\,,j = 1,... ,p}. A 
key of the proof is to exploit the fact that, by the setup of problem (22), (aj,... , aj) is 
automatically a solution to the problem 

p 

max > d 7 a,- — 2 d u a v , 

ai,...,a v 11 J J 

3 =1 

subject to cij > 0, djCLj < (j = 1,... ,p), and (A.l) 

v v 

'Yhd j a 2 j = '^dj. 
i=i i=i 

The Karush-Kuhn-Tucker condition for this problem gives 

- 1 + 2Aaj - d- l Pj = 0 forj G /C c , (A.2) 

-1 + 2Aaj + p k ~ d^pk = 0 for k (^ v) G /C, (A.3) 

— l + 2Aaj+^2 — Mfcj — d„ 1 p u = 0, (A. 4) 

^ fceK\W ' 

where A, pk > 0 (fc G K, \ {z'}), and pj > 0 satisfying p^aj =0 (j = 1,... ,p) are Lagrange 
multipliers. 

First, we show that a] > 0 and hence pj = 0 for j = 1,... ,p. If /C c = 0, then either 
dj > 0 for j = 1,... ,p, or a\ = ■ ■ • = aj = 0. The latter case is infeasible by the constraint 
d,,aj = dj- Suppose K, c ^ 0. By (A.2), a] > 0 for each j G /C c . Then aj. > 0 
for each k G /C because dfcaj, > djaj. 

Second, we show that ^ > 3. If /C c = 0, then v=p> 3. Suppose /C c ^ 0. Then A > 0 by 
(A.2). Summing (A.3) over fc (^ v) G /C and (A.4) shows that —i/ + 2A Y^k=\ aj + 2 = 0. 
Therefore, v > 2 or equivalently u>3. 

Third, we show that /C = {1,2,..., v} and /C c = {v + 1,... ,p}. For each k (^ i/) G /C and 
j G /C c , aj) < aj by (A.2)-(A.3) and then dfc > dj because dfcaj > djaj. The inequalities 
also hold for fc = z/, by application of the argument to problem (A.l) with v replaced by 
some k (^ v) G /C. Then K. c = {v + 1,... ,p} because d u > dj for each j G /C c , di > d 2 > 
■■■> d p . and z/ is the largest element in K.. 

Fourth, we show the expressions for (aj,..., aj) and the achieved maximum value. By 
the definition of /C, aj oc d^T 1 for k = By (A.2), a] oc 1 for j = v + 1,... ,p. Let 

= dj,aj, and = aj +1 . Then (y\z^) is a solution to the problem 

f p \ 

max (y — 2 )y + I dj j z, 

v,z \j=^+i J 
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subject to y > 0 , z > 0 , y>d v +iZ, and 






j=i 


By the definition of 1C, y 1 > d v +iz^ and hence (y',z') lies off the boundary in the 
constraint set. Then (y\z') is a solution to the foregoing problem with the constraint 
y > d u -f-iz removed. The problem is of the form of maximizing a linear function of ( y,z) 
subject to an elliptical constraint. Straightforward calculation shows that 

t fEU d A 1/2 v — 2 t (EU d >\ 1/2 
y V M v ) J2j=i dj 1 ’ 2 V M v J 

and the achieved maximum value is (Ej=i djY^M ^ 2 , where M„ = (i^ — 2) 2 /(5Z^y=i dj 1 ) 

+ Ej=„+i d j ■ 

Finally, we show that the sequence (M 3 , M 4 ,... M p ) is nonincreasing: M k > M k+ \, 
where the equality holds if and only if k — 2 = Ej =\ dk+i/dj. Because y t > d v +iz^ or 
^- 2 >E;=i d „+i/dj, this result implies that M„ > M u+ \ and hence A 1 is a unique 
solution to (22). Let L k = {(X^y=i djXEjLi dj" 1 ) - (k - 2) 2 }/ E * =1 dj 1 so that M k = 
YTj =1 d j ~ dj k . By the identity (b + 0) / (a + a) — b/a = (0/a — b/a){a/(a + a)} and simple 
calculation, 


L k +1 — L k 


E?-, (dj / d k +i + dk+i/dj) — 2k + 4 


d 


-1 

fc+i 


— d k +1 


{r k - (fc- 2)} 2 
r k (r k + 1 ) 



(k- 2) 2 

^UdJ 1 


*k +1 


E-i 1 ^ 1 


1 w J 

(A.5) 


where r k = Ej=i d k+ i/d r Therefore, L k < L k+ \. Moreover, L k = L k+ 1 if and only if 
r k = k- 2 , that is, Ej=i d k+ i/dj = k- 2. □ 


Proof of Corollary 3. It suffices to show (25). By Corollary 2, Efc=i d* v ld* k > v — 3 
and hence Efc=i d*/dj£ >v — 2. Then for v. 

t = ~~ 2)d*~ 1 dj < (v — 2)d*~ 1 < dj 

J ' Efc=i dfc 1 Efc=i dfc 1 dj+^j dj+7j 


because d* > d* for j < u. 


□ 


Proof of Theorem 3. Let L k = E j=i dj — (k — 2 ) 2 / Ej=i dj 1 so that M k = Ej=i dj — 
L k , similarly as in the proof of Theorem 2 . By equation (A.5) with r k = Ej=i d k+ i/d* 
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and dk+i replaced by d k+1 , 


Lv — L 3 + — L k ) — L 3 + dl. 


fe+i" 


{ rk ~ (fc - 2)} 2 

r-fe(r fc + 1 ) 


/c—3 /c—3 

By the relationship = (cZ^_|_ x /o?J)(l +rfc_i) and simple calculation, 

1 


£3 — d* + d* 2 + dg - - 


n 

u-l 

= d* 1 +d* 2 + d* 3 -'£d* k+1 


+ d% + dS 


k -3 


1 


1 


r-fc r fc + 1 


rv-i + 1 


If v > 4, combining the two preceding equation gives 

Lv =di+di+di+Y^di +1 {rk - [k ~ 2)}2 - 1 dl 


k =3 
v-1 


<di+d* 2 +d* 3 +j2d: 


rk{r k + 1 ) 


dl 


r v -1 +1 


fc +1 


k —3 


fc(fc + i) ^ 




k —3 


k + 1 


<dl+d*+d* 3 +d* 4 - 4-^. 

The first inequality follows because k — 2 < r k < fc for fc = 3,..., z/ — 1 
2 )} 2 /{t(t + 1 )} is increasing for k — 2 < t < k with a maximum at t = 
inequality follows because d\ > d 2 > ■ ■ ■ > d*. Therefore, if v > 4 then 




P -2 


P-2 


■3= 1 


53 ^- (di+t5+^+d2-4^ 




If v = 3, then £„ < d£ + d£ + d2j — dg/3 and hence 


^ 2 ^ > ^ | J2 d J - (dl + d* 2 + d* 3 - dg/3) j 

P / r, P 7 * 

-Vd*- I d*_?-Vd*_^ 

l d 3 2 2 ^dj 23 

J=3 \ ^ J=4 r 


and {t — (fc — 
fc. The second 
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This completes the proof. 


□ 
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Supplementary Material for “Improved minimax estimation of a multivariate 
normal mean under heteroscedasticity” (DOI: 10.3150/13-BEJ580SUPP; .pdf). We 
present additional results from the simulation study in Section 4. 
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